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ABSTRACT 

Wc present a multi-class neural network (NN) classifier as a method to measure non- 
Gaussianity, characterised by the local non- linear coupling parameter /nl, in maps 
of the cosmic microwave background (CMB) radiation. The classifier is trained on 
simulated non-Gaussian CMB maps with a range of known /nl values by providing it 
with wavelet coefficients of the maps; we consider both the HealPix (HW) wavelet 
and the spherical Mexican hat wavelet (SMHW). When applied to simulated test maps, 
the NN classfier produces results in very good agreement with those obtained using 
standard \ 2 minimization. The standard deviations of the /nl estimates for WMAP- 
like simulations were a — 22 and a = 33 for the SMHW and the HW, respectively, 
which are extremely close to those obtained using classical statistical methods in Curto 
et al. and Casaponsa et al. Moreover, the NN classifier does not require the inversion 
of a large covariance matrix, thus avoiding any need to regularise the matrix when it 
is not directly invertible, and is considerably faster. 
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1 INTRODUCTION 

Artificial intelligence algorithms are being used increasingly 
to improve the efficiency of computationally intensive data 
analysis. In particular, neural networks (NN) have been suc- 
cessfully applied to pattern recognition, classification of ob- 
jects and parameter est imation in a num ber of fields, includ- 
ing cosmology (see e.g. lAuld et alj|2007f ). 

Cosmological analysis typically involves the use of large 
datasets and high precision numerical tools. In particular, 
the study of deviations from Gaussianity in the distribu- 
tion of temperature anisotropies in the cosmic microwave 
background (CMB) require very demanding computational 
methods. The simplest way to characterise such a deviation 
is through third order moments, as these vanish in the Gaus- 
sian case. It is now commonplace in CMB analysis to work in 
spherical harmonic space, where computing the three point 
correlation function or bispectrum can prove difficult, or in- 
deed impossible, due to numerical instability. Some recent ef- 
forts have been applied to lessen the computational demand 
without reducing e fficiency; see for exam ple the KSW bis- 
pectrum estimator l|Komatsu et al.|[2005t ). or the binned esti- 
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mator (|Bucher et alj|20ich . Other methods which have also 
been applie d to non-Gaussianity analysis include M inkowski 
functionals (Iffikage et al.ll2008l; iNatoli et alJl201Ch, wavelet- 
based methods (ICavon et al.l l200ll; iMukheriee fc Wane 



2004ICurto et al.l2009allbl:lPietrobon 2010; Casaponsa et al 



2010), a Bayesian approach (Eis ner fc Wandeltl 2010) and 



the a nalysis of the A-dim ensional probability density func- 
tion l|Vielva fc Sanzll2010h . 

This paper introduces an approach based on a neural 
network classifier which, after training on third order mo- 
ments of wavelet coefficients derived from simulated Gaus- 
sian and non-Gaussian CMB realisations, can be used to es- 
timate the presence and degree of non-Gaussianity for any 
given data map. We have chosen to estimate the local non- 
linear coupling parameter /nl , which parameterises the local 
non-Gaussianity as a quadratic term in the primordial cur- 
vature perturbation. More precisely, /jvl is the amplitude of 
the corrections at second order of the primordial curvature 
perturbations dSalopek fc Bondl [l 990; Gan gui et al.l Il994l ; 
IVerde et al.ll2000l ; iKomatsu fc Spergelll200lD . This type of 
non-Gaussianity is predicted even in the simplest slow-roll 
inflationary scenario, albeit at a very low level /nl < 1, 
whereas a wide range of non-standard inflationary models 
predict much larger typical /nl values (for a more com- 
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plet e review see iBartolo et al.1 l|2004MBabich et all (|2004l ') 
and lYadav fc WandeltJ <|201oF )7 Estimating the value of /nl 
from a given data map using existing methods typically has 
a high computational cost and usually numerical problems 
arise (e.g. matrix inversion). As we will show, the use of 
neural networks bypasses these difficulties. 

In principle, one could use the pixel temperatures in 
the CMB map directly, or its spherical harmonic coefficients, 
as the inputs to the neural network classifier. Nonetheless, 
we perform a pre-processing step in which we decompose 
the temperature maps into their wavelet coefficients, which 
have shown themselves to be sensitive to non-Gaussian sie 



nals Ce.g. lCurto et al.ll2009bl . l2011al : ICasaponsa et afll201C 
In particular, we consider the HealPix wavelet (HW) and 
a spherical Mexican hat wavelet (SMHW), and compute 
third-order moments of these wavelet coefficients, the mean 
value of which is proportional to /nl- The network is then 
trained so that when presented with these cubic statistics 
for a new (data) map, it can estimate the /nl value and 
its error bar. We apply this method to estimate the degree 
of non-Gaussianity in the Wilkinson microwave anisotropy 
probe (WMAP) 7-year data release. 

This paper is organized as follows. In Section we 
give a brief introduction to the wavelet analysis used. An 
overview of the type of neural network employed and our 
training procedure follows in Section [3] In Section [4] we 
explain the generation of our Gaussian and non-Gaussian 
simulations, and the specific characteristics of our /nl clas- 
sification network. We present the results of applying our 
classifier to simulations and to WMAP 7-year data in Sec- 
tion [5] Our conclusions are summarised in Section [6] 



2 WAVELETS 

Wavelet methods have seen increasing usage in cosmol- 
ogy. This has been particularly marked in CMB non- 
Gaussianity analyses, in which competitive resu lts have been 
obtai n ed using wav e lets s u ch as the SMHW on et al.l 



IVielva et al.1 12004 ICruz et al.1 12005]; 



Curto et al.l 



directional spherical wavel ets (McEwcn et al 



20081). spherical Haa r wavelet (SHW) (|Tenorio et al.1 1 19991 ; 
Barreirp et al.l 120001) . and re cently the HealPix wavelet 
(HW) (|Casaponsa et al.ll2010h . For a review of wave l ets ap - 
plied on the sphere, see, for example. iMcEwen et al.l l|2007h . 
In essence, decomposing a CMB map into its wavelet coeffi- 
cients allows one to separate the search for non-Gaussianity 
on different length-scales, while retaining positional informa- 
tion. In this section we will briefly discuss the characteristics 
of both the HW and SMHW and describe how we construct 
the statistics which are used in our analysis. 



2.1 HealPix wavelet 

The HealPix wavele t is very similar to that presented by 
IShahram et al.l (|2007T l . ICasaponsa et al.l (|2010t ) have used a 
revised version of this wavelet and perform the first cos- 
mological application. In both papers, the central idea is 
the construction of a fast wa velet method adap ted to the 
HealPix pixelization scheme (|G6rski et al-l feoOSl . The HW 
is similar to the SHW in the sense that, at each level of 



the wavelet transform, one produces both a high- and low- 
resolution map. The low-resolution map for the HW is ob- 
tained simply by averaging over 4-pixel blocks, and the 
high-resolution map is just the original map minus the low- 
resolution map. One begins with the original map at reso- 
lution J — 9 (iV s idc = 512) and performs successive wavelet 
decompositions until resolution J = 2 (N S id e = 2), thereby 
constructing 7 sets of high- and low-resolution maps. Al- 
though the original map is fully represented by the 7 high- 
resolution maps plus the low-resolution map at J = 2, in 
our analysis we have used all the high- and low-resolution 
maps, plus the origi nal map, since this ha s been shown to 
improve results (see ICasaponsa et al,ll201ol . for details). 

Using all these maps, the third order moments of the 
wavelet coefficients are computed as follows: 



1 }~,i=i Wi,jWi,fcWi,i 



Ni 



(1) 



where Wij is the i 4 wavelet coefficient of the map at res- 
olution j, Gj is the dispersion of Wij, and Ni is the num- 
ber of pixels in the map at resolution I (since one requires 
j ^ k ^ I). Some of these statistics are redundant (linearly 
dependency exists between them), so we restrict our analy- 
sis to the set of non-redudant statistics, which gives a total 
of 232 quantities; these are then computed for non-Gaussian 
simulations with a range of known values of /nl ■ 

The expected values of these statistics are proportional 
to the non-linear coupling parameter, and they have pre- 
viously been used to estimate the best fit /nl value for 
the data by weighted least squares parameter estimation 
l|Casaponsa et ai1l2010h . In this case, the dispersion in the 
estimated /nl value for Gaussian simulations and is found 
to be ct(/nl) = 34, which is slightly larger that the optimal 
value. The main advantage of the HW is the computing ef- 
ficiency; for example, the third-order statistics construction 
is 10 3 times faster th an for the KSW bispectrum estimator 
l|Komatsu et al.ll2005h and 10 2 times faster than the SMHW 
(see below). This procedure (for both the HW and SMHW) 
does, however, include the estimation and inversion of a cor- 
relation matrix, which can be computationally demanding 
and, in some cases, close to singular. As we will show below, 
this step is avoided with the use of a NN classifier. 



2.2 Spherical Mexican Hat Wavelet 

The spherical Mexica n h at wavelet (SMHW) 

llAntoine fc Vanderghevnstl 1 19981 ; iMartmez-Gonzalez et all 
2002) has produced competiti ve results in const r aining 
primordial non-G a ussianit y (|Mukheriee fc Wand |2004| ; 
ICurto et al.ll2009al lbl,l 2011al ). It is a continuous wavelet that 
has better frequency localization than the HW, alth ough 
the computing efficiency is lower. ICurto et al.l (|2011al ) use 
the SMHW to constrain /nl with an accurac y equivalent 
to th e bispectrum estimator s (see f or example ISmith et al.l 
20091; IFergusson fc Shellardl 120091 IFergusson et al.1 |2010| : 



Komatsu et al.ll201ll ; iBucher et alll2010l ). The definition of 
the third-order moments is the same as for the HW. In this 
case, however, there are more inter-scale combinations be- 
cause the scales involved are not restricted by the HealPix 
pixelization. The total number of non-redundant statistics 
for the SMHW wavelet coefficients is 680. Using the mean 
values and covariances of these statistics computed from 
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Figure 1. Schematic of a 3-layer feed- forward neural network. 



non-Gaussian simulations, ICurto et al.l (|201 lal ) applied a 
X 2 minimisation method to obtain optimal uncertainties 
on the /nl estimates of <r fa 21. However, this method 
requires a principal component analysis to deal with the 
degenerancies present in the covariance matrix. As we will 
see, this problem is avoided with the use of the multi-class 
neural network classifier. 



3 NEURAL NETWORKS 

Artificial neural networks are a methodology for comput- 
ing, based on massive parallelism and redundancy, which 
are features also found in animal brains. They consist of 
a number of interconnected processors each of which pro- 
cesses information and passes it to other processors in the 
network. Well-designed networks are able to 'learn' from 
a set of training data and to make predictions when pre- 
sented with new, possibly incomplete, data. These algo- 
rithms have been successfully applied in several areas, in par- 
ticular, we note the fol lo wing a pplications in astro p hysics : 
Storrie-Lombardi et al l (Il992l); Baccigalupi et all f2 000); 



Vanzella et all |2004 ); lAuld et alj \200H ) and lCarballo et all 
(2QQg). 

The basic building block of an ANN is the neuron. In- 
formation is passed as inputs to the neuron, which processes 
them and produces an output. The output is typically a sim- 
ple mathematical function of the inputs. The power of the 
ANN comes from assembling many neurons into a network. 
The network is able to model very complex behaviour from 
input to output. We use a three-layer feed- forward network 
consisting of a layer of input neurons, a layer of 'hidden' 
neurons and a layer of output neurons. In such an arrange- 
ment each neuron is referred to as a node. Figure [T] shows a 
schematic design of such a network. 

The outputs of the hidden layer and the output layer 
are related to their inputs as follows: 

hidden layer: h 3 = gW(fV); /f > = 5>£>a: 4 + 6^,(2) 

i 

output layer: y k = g&> (/f ) ; /f = ]T w% hj + (3) 

3 

where the output of the hidden layer h and output layer y are 
given for each hidden node j and each output node k. The 



index i runs over all input nodes. The functions and g' 2 ' 
are called activation functions. The non-linear nature of gr' 1 ' 
is a key ingredient in constructing a viable and practically 
useful network. This non-linear function must be bounded, 
smooth and monotonic; we use g ( - 1 \x) = tanhx. For 
we simply use g^ 2 '(x) = x. The layout and number of nodes 
are collectively termed the architecture of the network. For 
a basic intro duction to artific ial neural networks the reader 
is directed to iMacKavl (|2003l ). 

For a given architecture, the weights w and biases b de- 
fine the operation of the network and are the quantities we 
wish to determine by some training algorithm. We denote 
w and b collectively by a. As these parameters vary during 
training, a very wide range of non-linear mappings between 
inputs and outputs is possible. In fact, according t o a 'uni- 
versal approximation theorem ' iLeshno et ail (l993), a stan- 
dard three-layer feed-forward network can approximate any 
continuous function to any degree of accuracy with appro- 
priately chosen activation functions and a sufficient number 
of hidden nodes. 

In our application, we will construct a classification net- 
work. The aim of any classification method is to place mem- 
bers of a set into groups based on inherent properties or fea- 
tures of the individuals, given some pre-classified training 
data. Formally, classification can be summarised as finding 
a classifier C : x — > C which maps an object from some (typ- 
ically multi-dimensional) feature space x to its classification 
label C, which is typically taken as one of {1, N} where 
iV is the number of distinct classes. Thus the problem of 
classification is to partition feature space into regions (not 
necessarily contiguous), assigning each region a label corre- 
sponding to the appropriate classification. In our context, 
the aim is to classify sets of third-order statistics of wavelet 
coefficients of (possibly) non-Gaussian CMB maps (assem- 
bled into an input feature vector x) into classes defined by 
ranges of /nl; this is discussed in more detail below. 

In building a classifier using a neural network, it is con- 
venient to view the problem probabilistically. To this end we 
consider a 3-layer MLP (multi-layer percepton) consisting 
of an input layer (xi), a hidden layer (hj), and an output 
layer (yi). In classification networks, however, the outputs 
are transformed according to the softmax procedure 



Pk 



V e y m ' 



(4) 



such that they are all non-negative and sum to unity. In 
this way p k can be interpreted as the probability that the 
input feature vector x belongs to the kih class. A suitable 
objective function for the classification problem is then 



£ ( a ) = EE4°lnp*(x (i) ,a) 



(5) 



where the index I runs over the training dataset T> — 
{x^,t^}, in which the target vector for the network 
outputs has unity in the element corresponding to the true 
class of the I th feature vector x"' and zeroes elsewhere. One 
then wishes to choose network parameters a so as to max- 
imise this objective function as the training progresses. The 
advantage of this probabilistic approach is that we gain the 
ability to make statistical decisions on the appropriate clas- 
sification in very large feature spaces where a direct linear 
partition would not be feasible. 
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One wishes to choose network parameters a so as to 
maximise the objective function £(a) as the training pro- 
gresses. This is, however, a highly non-linear, multi-modal 
function in many dimensions whose optimisation poses a 
non-trivial problem. We perform this op timisation using 
the MemSys package (|Gull fc Skillindl 19991) . This algorithm 
considers the parameters a to have prior probabilities pro- 
portional to e" 5 *- 11 - 1 , where S(&) is the pos itive-negative en- 
tropy functional dHobson fc La senbv 1998). a is treated as a 
hyper-parameter of the prior, and sets the scale over which 
variations in a are expected, a is chosen to maximise its 
marginal posterior probability whose value is inversely pro- 
portional to the standard deviation of the prior. Thus for 
a given a, the log-posterior probability is proportional to 
£(a) + aS(a). For each chosen a there is a solution a that 
maximises the posterior. As a varies, the set of solutions a 
is called the maximum- entropy trajectory. We wish to find 
the solution for which £ is maximised which occurs at the 
end of the trajectory where a = 0. For practical purposes we 
start at a large value of a and iterate downwards until a is 
sufficiently small so that the posterior is dominated by the 
£ term. MemSys performs this algorithm using conjugate 
gradient descent at each step to converge to the maximum- 
entropy trajectory. The required matrix of second deriva- 
tives of £ is approximated using vector routines only, thus 
circumventing the need for 0(N 3 ) operations required for 
exact calculations. The application of MemSys to the prob- 
lem of network training allows for the fast efficient train- 
ing of relatively large network structures on large data sets 
that would otherwise be difficult to perform in a reason- 
able time. Moreover the MemSys package also computes 
the Bayesian evidence for the model (i.e. n etwork) under 
consideration, (see for example |Javnesll2003l . for a review), 
which provides a powerful model selection tool. In principle, 
values of the evidence computed for each possible architec- 
ture of the network (and training data) provide a mechanism 
to select the most appropriate architecture, which is simply 
the one that maximises the evidence (although we will use a 
more prosaic method below for deciding on the network ar- 
chitectur e). The MemSys algo rithm is described in greater 
detail in (|Gull fc Skillinell 19991 ) . 



4 THE F NL CLASSIFICATION NETWORK 

To train our /nl classification network we provide it with an 
ensemble of training data V = {x (i) , t (i) }. The I th input vec- 
tor x"' contains the third-order statistics of the wavelet co- 
efficients of the I th simulated CMB map. The output classes 
of our network correspond to contiguous ranges of /nl val- 
ues. Thus, the target vector for the network outputs has 
zeroes everywhere except for a unit entry in the element cor- 
responding to the class in which the true /nl value of the 
I th simulated CMB map falls. 

The N output classes of the network were defined by 
dividing some initial (anticipated) range of /nl values into 
iV equal-width subranges. For example, for a total range of 
—30 ^ /nl < 30 and a network with just 3 output classes, 
input vectors constructed from maps with —30 ^ /nl < — 10 
were ascribed to class=l with an associated target vector 
t = (1, 0, 0), maps with — 10 ^ /nl < 10 to class— 2 with t = 
(0, 1, 0), and those with 10 < /jvl < 30 to class=3 with t = 



(0,0, 1). In this example, the output given by the network 
for some test input vector x would be a 3-dimensional vector 
p = (pi,p2,P3), where ~}2 k Pk = 1 and p k can be interpreted 
as the probability that the input vector belongs to class k. 
The discrepancy between the targets and the outputs during 
training can be measured by the true positive rate, which is 
simply the fraction of the training input vectors for which 
the network assigns the maximum probability to the correct 
class. 

From the output values p k obtained for each map, we 
define the estimator of the local non-Gaussianity parameter 
to be simply 

"class 

/NL = ^2 (fNL)kPk (6) 
fc=l 

where (/nl)* is the mean value of /nl in the k th class. The 
statistical properties of this estimator, namely its mean and 
dispersion, determine the accuracy of the method. 



4.1 Training data 

The training input vectors x*-'' were generated as fol- 
lows. We began with a set of 1000 non-Gaussian CMB 
realisations from which a^m and aff m were generated by 
lElsner fc Wandeltl (120091 ) and normalized to the WMAP 
7-year concordance model power spectrum generated by 
CAMB. These a; m are publicly available^. The ultimate ac- 
curacy of the network classifier is improved, however, by the 
inclusion of further training data. Given the finite number 
of available simulations, we thus created a further set by 
rotation of the original maps by 90° perpendicular to the 
galactic plane. This rotation creates roughly 20 per cent ex- 
tra map area based on the original mask; we verified that 
its inclusion improves the results. Using this procedure we 
generate a further 1000 non-Gaussian simulations. Of the 
2000 non-Gaussian maps, 1800 were used for training and 
the remainder were set aside for testing of the networks. 

For each non-Gaussian simulation used for training, sets 
of aim were then generated with varying /nl using the fol- 
lowing prescription 

G | r NG i„\ 
0-lm — a im + /NLClim , (l) 

with 20 different /nl random values between —120 and 120 
for the HW decomposition and between — 76 and 76 for the 
SMHW analysis. Thus, for each non-Gaussian simulation, 20 
sets of aim were generated. Hence the total number of avail- 
able training data sets is 36000. Noise-weighted V+W-band 
WMAP real i zations were then constructed as ex plained in 
ICurto et ail (1200981 ) and ICasaponsa et all (|2010l ). and the 
KQ75 mask was then applied, which covers roughly 29% 
of the sky. A wavelet decomposition for both the HW and 
SMHW was performed to determine the wavelet coefficents 
for each a; m set, and their third-order moments computed. 
These statistics were provided as inputs to the neural net- 
work. Each input vector contained 232 values for the HW 
and 680 for the SMHW. 



1 http:/ /planck. mpa-garching.mpg.de/cmb/fnl-simulations/ 
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4.2 Network architecture 

The architecture of our 3-layer neural networks are defined 
by two free parameters: the number of hidden nodes rihid 
and the number of output classes, n c i ass , into which the /nl 
range is divided. A further parameter, which determines the 
accuracy of the network classifier, is the quantity of train- 
ing data ridata- Variation of these parameters can affect the 
training efficiency so it is desirable to explore this training 
space adequately in order to find an optimal set of parame- 
ters. 

Although the MemSys algorithm provides routines to 
determine the optimal value of the number of hidd en nodes 
using the Bayesian evidence iGull fc Skilllnl l| 19991 ). in this 
application rihid is determined simply by measuring training 
times and the accuracy of the trained networks on an in- 
dependent testing set. In this example, we have found that 
in fact the optimal architecture contains no hidden nodes, 
resulting in what is effectively a linear classifier. This is not 
surprising, since we are effectively 'asking' the network to 
learn the mean value and dispersion of the third-order mo- 
ments of the wavelet coefficients for each /jvl ; since the ex- 
pectation value is linearly dependent on the /nl, this net- 
work architecture trivially satisfies this requirment. Indeed, 
networks of this sort provide a simple way of obtaining the 
(pseudo)inverse of any matrix. 

The number of output classes, n c i ass , of the network 
is clearly related to the total range of /nl considered and 
size of the subranges into which this range is divided. Here 
we consider networks with n c i ass = 9 (an odd number en- 
sures that /nl = does not lie on the boundary of a class) 
The range of /nl was chosen a priori to correspond to 
approximately ±3cr, where a is the dispersion in the /nl 
estimates obtained previously using the standard x 2 min- 
imisation method. Thus, the full range was taken to be 
-120 s; /nl < 120 for the HW and -76 /nl < 76 for 
the SMHW, resulting in subranges per class of width 27 
and 17 units, respectively. This combination fulfilled all the 
requirements of classification over the range of our simulated 
data. 

The quantity of training data, ridata, determines the ac- 
curacy of the resulting classification network. Naturally, the 
network accuracy increases with ridata, but it typically sat- 
urates after a given number. We found that the quantity 
of data required saturated at roughly ridata ~ 10000 (see 
Fig. 



4.3 Training evolution 

Figure [3] illustrates the training evolution for the classifica- 
tion network with rihid = and n c i ass = 9. In the top two 
panels we plot the true positive rates (TPR) of the network 
on the training set and the test set, for the HW and SHMW 
respectively; in each plot, the TPR on the training set has 
been mutliplied by a factor less than unity to highlight the 
divergence with the TPR for the test set. We see that this 
divergence occurs after ~ 100 and ~ 500 iterations of the 
MemSys optimiser for the HW and SMHW, respectively. 
Thus the training was terminated at this point to construct 
our final classification networks. 

A key criterion in determining the quality of our classi- 
fiers is the dispersion of the /nl values obtained in the test- 



■ HW 

♦ SMHK 



Figure 2. Results of the dispersion of /nl for 1000 Gaussian 
simulations for different values of ridata- 



ing set. This is plotted in the bottom two panels of Figure [3] 
for the HW and SMHW, respectively. We note that, in each 
case, this dispersion increases noticeable beyond the point 
where the TPRs on the training and testing sets diverge. 



5 RESULTS 

5.1 Application to WMAP simulations 

We first applied our classifiers to 1000 WMAP-7yr simu- 
lations made from Gaussian CMB maps (/nl = 0). For 
the HW classifier, we obtained (/nl) = —1, which indi- 
cates the estimator is essentially unbiassed. Moreover, the 
dispersion of the estimator ct(/nl) = 33 is extremely similar 
to that obtained with the weighted least-squares method 
(ct(/jvl) = 34). The full distribution of the estimator is 
shown in the top panel of Fig. [4] For the SMHW classifier, we 
again found (/nl) = —1, with a dispersion of <t(/nl) = 22, 
which is very close to the optimal value of a (/nl) = 21. The 
distribution of the estimator for the SMHW is shown in the 
bottom panel of Fig. [4] 

The histogram bins in Fig. 2] have the same size and 
central values as those used to define the network classes. 
We see that the classes at extremal /nl values are empty, 
indicating that the network placed no maps in these /nl 
ranges. Thus for estimating /nl from Gaussian or nearly 
Gaussian maps the range in /nl used is sufficiently wide. 

We next applied our estimator to sets of non-Gaussian 
simulations, each with a different non-zero /nl value. For 
each true /nl value, we analysed the corresponding WMAP 
simulations and calculated the mean and dispersion of our 
estimator /nl for both the HW and SMHW classifiers. The 
results are shown in fig. [5] in which we plot the mean value 
of /nl against the true /nl value. We see that the classifiers 
are unbiassed for |/nl| < c with an almost constant disper- 
sion. For larger |/nl| values, however, the estimator becomes 
progressively more biassed and its dispersion decreases. 

The latter behaviour is simply understood as an edge 
effect due to the finite total range of /nl considered by the 
networks. This point is illustrated in Fig. [5] in which we 
plot the full distributions of /nl obtained for a number of 
representative values of the true /nl- We see that for |/nl| < 
a, we obtain close to symmetric distribution centred on the 
true /nl value, with no maps being placed in the extreme 
classes. As |/nl| > o", however, we see that the classifier 
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Figure 3. Evolution of the true positive rate for each iteration of the training process with a neural network with Hhid = and 
^data = 10000. Note that the TPR of the training set have been multiplied by a factor less than unity in order to highlight the divergence 
of the behaviours. The bottom panels show the variation of the dispersion on the estimate /nl during the training. Left panels for HW 
and right panels for SMHW. 



does begin to place maps in the extreme classes, resulting 
in the distribution of /nl becoming severely skewed and no 
longer centred on the true value. Of course, if one were to 
encounter this behaviour in the analysis of a real data set, 
one could simply alter the range of /nl considered by the 
network and retrain. 

In any case, the above results show that both the HW 
and SMHW network classifiers produce unbiassed estimates 
/nl provided —a < }nl < cr. Moreover, the dispersions 
on these estimators are very similar to those obtained with 
the classical weighted least squares (WLS) method, indi- 
cating that neural networks can produce very accurate re- 
sults within the limitations described above. In the case of 
the SMHW, this is a particularly important result since the 
complexity of the covariance matrix inversion required in 
the standard app roach is bypassed vi a the use of the neural 
network classifier. ICurto et alj l|2011aT ) used a principal com- 
ponent analysis to reduce the covariance matrix dimension 
to allow inversion. 



Figure 4. Distribution of /nl obtained from 1000 Gaussian re- 
alizations for HW (top) and SMHW (bottom). 



5.2 Application to WMAP 7-year data 

Applying the neural network classifiers to real data (the 
V+W WMAP 7-year data map), we obtain /nl = -12 
for the HW and /nl = 19 for the SMHW. Both these val- 
ues lie well within the corresponding dispersion of the esti- 
mator. From the corresponding /nl distributions obtained 
on simulated data, we find that 95% confidence limits are 
-78 < /nl < 51 for the HW and -24 < /nl < 61 for the 
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Table 1. Results obtained with neural networks (NN) and weighted least squares (WLS). /NL,data is the best fitting value for V+W 
WMAP data, (/nl, gauss) and c(/nl) are the expected value and the standard deviation for Gaussian simulations. P2.5 and P97.5 
represent the percentile values at 95% confidence level of /nl for Gaussian realizations. 
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Figure 6. Distribution of /nl obtained from 200 non-Gaussian realizations with representative true /nl values, for HW (left) and 
SMHW (right). 



SMHW0 We therefore conclude that the data are consistent 
with the Gaussian hypothesis. We note that the SMHW con- 
fidence limits are very s imilar to those obtained with the 
optim al /nl estimator (Ko matsu et al.l l201.lt ISmith et all 
120091) . 

These results are summarised in Table [1] along with 
found via the weighted least squares (WLS) method. The 
latter results are also consistent with Gaussianity. It is worth 
mentioning, however, the different values of /nl obtained by 
the neural network and the WLS methods, for both HW and 
SMHW. Although all four values lie well within their cor- 
responding dispersions, each method returns a different /nl 
value when applied to the same WMAP-7yr dataset. This 
behaviour is to be expected, however, since these are four 
different estimators of /nl- Therefore, in general, they will 
not be equal, even when applied to the same input data. 
Only the statistical properties (e.g. mean, dispersion) of 
their sampling distributions are important. 

2 Note that the constraints are not corrected for the unresolved 
point sources contribution. 



6 CONCLUSIONS 

We have trained a multi-class neural network classifier with 
third-order moments of the HW and SMHW coefficients 
of non-Gaussian realizations in order to set constraints on 
the local non- linear coupling term /ml using WMAP 7- year 
data. We found that with a very simple network and with 
few iterations (requiring just a few sees CPU time) it is pos- 
sible to produce the same results as those obtained with 
the weighted least squares method. This is an interesting 
achievement, as it bypasses any covariance matrix related 
computations and their associated regularisation problems. 
The estimation of the covariance matrix with both wavelets 
requires the analysis of at least 10000 Gaussian simulations 
which involves a huge demand in CPU time, in particular 
with the SMHW statisitcs. The error bars on the estimation 
of /nl computed with Gaussian simulations are ct(/jvl = 33) 
for HW and a(f NL ) = 22 for SMHW, which are extre mely 
similar to the ones obtained in lCasaponsa et alj l|2010l ) and 
ICurto et all i|2011aD using the same statisitcs but a differ- 
ent estimator based on the weighted least squares method 
(cr = 34, ct = 21 for HW and SMHW respectively). The 
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/nl 



Figure 5. The mean and dispersion of /nl obtained for a number 
of representative values of the true /nl for the HW network (top) 
and the SMHW network (bottom). 



constraints for WMAP 7-year data were found to be —78 < 
/nl < 51 for the HW and -24 < /nl < 61 for the SMHW, 
w hich are compatible to a Gaussian distr i bution as found 
Smith et all ll2009fi; ICurto et all l|2009bl ); iKomatsu et all 



l|201lf ): ICasaponsa et al.l (|20ld ) and Curto et al. (2011b). 
The results obtaine d with the SMHW stati stics are similar 
to the ones found in lSmith et al.l l|2009l ) and lKomatsu et al.l 
(2011), which are the most stringent ones currently available 
at the limit of the WMAP resolution. Further analysis, as to 
the contribution to /nl of unresolved point sources or fore- 
grounds can be performed by applying the linear classifier to 
the statistics of new simulated maps with this characteristic 
signal. 



ACKNOWLEDGMENTS 

We acknowledge partial financial support from the Spanish 
Ministerio de Ciencia e Innovation project AYA2010-21766- 
C03-01 and from the CSIC-The Royal Society joint project 
with reference 2008GB0012. B. Casaponsa thanks the Span- 
ish Ministerio de Ciencia e Innovation for a pre-doctoral fel- 
lowship. The authors acknowledge the computer resources, 
technical expertise and assistance provided by the Span- 
ish Supercomputing Network (RES) node at Universidad de 
Cantabria. We acknowledge the use of Legacy Archive for 
Microwave Background Data Analysis (LAMBDA) . Support 
for it is provided by the NASA Office of Space Science. The 
HealPix package was used throughout the data analysis 
|G6rski et al.ll2005l ). 



REFERENCES 

Antoine J.-P., Vandergheynst P., 1998, Journal of Mathe- 
matical Physics, 39, 3987 



Auld T., Bridges M., Hobson M. P., Gull S. F., 2007, MN- 
RAS, 376, Lll 

Babich D., Creminelli P., Zaldarriaga M., 2004, Journal of 

Cosmology and Astro-Particle Physics, 8, 9 
Baccigalupi O, Bedini L., Burigana O, De Zotti G., Farusi 

A. , Maino D., Maris M., Perrotta F., Salerno E., Toffolatti 
L., Tonazzini A., 2000, MNRAS, 318, 769 

Barreiro R. B., Hobson M. P., Lasenby A. N., Banday A. J., 
Gorski K. M., Hinshaw G., 2000, MNRAS, 318, 475 

Bartolo N., Komatsu E., Matarrese S., Riotto A., 2004, 
Phys.Rev.D, 402, 103 

Bucher M., van Tent B., Carvalho C. S., 2010, MNRAS, 
407, 2193 

Carballo R., Gonzalez-Serrano J. I., Benn C. R., Jimenez- 

Lujan F., 2008, MNRAS, 391, 369 
Casaponsa B., Barreiro R. B., Curto A., Martinez- Gonzalez 

E. , Vielva P., 2010, ArXiv e-prints 

Cayon L., Martinez- Gonzalez E., Argiieso F., Banday A. J., 

Gorski K. M., 2003, MNRAS, 339, 1189 
Cayon L., Sanz J. L., Martinez-Gonzalez E., Banday A. J., 

Argueso F., Gallegos J. E., Gorski K. M., Hinshaw G., 

2001, MNRAS, 326, 1243 
Cruz M., Martmez-Gonzalez E., Vielva P., Cayon L., 2005, 

MNRAS, 356, 29 
Curto A., Martmez-Gonzalez E., Barreiro R. B., 2009b, 

ApJ, 706, 399 

Curto A., Martinez- Gonzalez E., Barreiro R. B., 2011a, 
MNRAS, 412, 1023 

Curto A., Martinez- Gonzalez E., Barreiro R. B., Hobson 
M. P., 2011b, submitted to MNRAS 

Curto A., Martmez-Gonzalez E., Mukherjee P., Barreiro 
R. B., Hansen F. K., Liguori M., Matarrese S., 2009a, 
MNRAS, 393, 615 

Eisner F., Wandelt B. D., 2009, ApJS, 184, 264 

Eisner F., Wandelt B. D., 2010, ApJ, 724, 1262 

Fergusson J. R., Liguori M., Shellard E. P. S., 2010, Phys- 
ical Review D, 82, 023502 

Fergusson J. R., Shellard E. P. S., 2009, Phys. Rev. D, 80, 
043510 

Gangui A., Lucchin F., Matarrese S., Mollerach S., 1994, 
ApJ, 430, 447 

Gorski K. M., Hivon E., Banday A. J., Wandelt B. D., 
Hansen F. K., Reinecke M., Bartelmann M., 2005, ApJ, 
622, 759 

Gull S. F., Skilling J., 1999, Quantified maximum entropy: 
Mem-Sys 5 users manual. Maximum Entropy Data Con- 
sultants Ltd, Royston 

Hikage O, Matsubara T., Coles P., Liguori M., Hansen 

F. K., Matarrese S., 2008, MNRAS, 389, 1439 
Hobson M., Lasenby A., 1998, 298, 905 

Jaynes E., 2003, Probability Theory: The Logic of Science. 

Cambridge University Press 
Komatsu E., Smith K. M., Dunkley J., Bennett C. L., Gold 

B. , Hinshaw G., Jarosik N., Larson D., Nolta M. R., Page 
L., Spergel D. N., Halpern M. an Wright E. L., 2011, 
ApJS, 192, 18 

Komatsu E., Spergel D., 2001, Phys.Rev.D, 63, 063002 
Komatsu E., Spergel D. N., Wandelt B. D., 2005, ApJ, 634, 
14 

Leshno M., V. Y., A. P., Schocken S., 1993, Neural Netw., 
6, 861 

MacKay D., 2003, Information Theory, Inference and 



© 0000 RAS, MNRAS 000, 000-000 



WMAP 7yr local /nl constraints using neural networks 9 



Learning Algorithms. Cambridge University Press 
Martinez- Gonzalez E., Gallegos J. E., Argueso F., Cayon 

L., Sanz J. L., 2002, MNRAS, 336, 22 
McEwen J. D., Hobson M. P., Lasenby A. N., Mortlock 

D. J., 2008, MNRAS, 388, 659 

McEwen J. D., Vielva P., Wiaux Y., Barreiro R. B., Cayon 
L., Hobson M. P., Lasenby A. N., Martinez-Gonzalez E., 
Sanz J. L., 2007, Journal of Fourier Analysis and Appli- 
cations, 13, 495 

Mukherjee P., Wang Y., 2004, ApJ, 613, 51 

Natoli P., de Troia G., Hikage C, Komatsu E., Migliac- 
cio M., Ade P. A. R., Bock J. J., Bond J. R., Borrill J., 
Boscaleri A., Contaldi C. R., Crill B. P., de Bernardis P., 
de Gasperis G., de Oliveira-Costa A., di Stefano G., Hivon 

E. , 2010, MNRAS, 408, 1658 

Pietrobon D., 2010, Memorie della Societa Astronomica 
Italiana Supplementi, 14, 278 

Salopek D. S., Bond J. R., 1990, Phys.Rev.D, 42, 3936 

Shahram M., Donoho D., Starck J., 2007, in Society of 
Photo-Optical Instrumentation Engineers (SPIE) Confer- 
ence Series Vol. 6701 of Society of Photo-Optical Instru- 
mentation Engineers (SPIE) Conference Series, Multiscale 
representation for data on the sphere and applications to 
geopotential data 

Smith K. M., Senatore L., Zaldarriaga M., 2009, Journal 
of Cosmology and Astro-Particle Physics, 9, 6 

Storrie-Lombardi M. C, Lahav O., Sodre Jr. L., Storrie- 
Lombardi L. J., 1992, MNRAS, 259, 8P 

Tenorio L., Jaffe A. H., Hanany S., Lineweaver C. H., 1999, 
MNRAS, 310, 823 

Vanzella E., Cristiani S., Fontana A., Nonino M., Arnouts 
S., Giallongo E., Grazian A., Fasano G., Popesso P., 
Saracco P., Zaggia S., 2004, A&A, 423, 761 

Verde L., Wang L., Heavens A. F., Kamionkowski M., 2000, 
MNRAS, 313, 141 

Vielva P., Martinez-Gonzalez E., Barreiro R. B., Sanz J. L., 
Cayon L., 2004, ApJ, 609, 22 

Vielva P., Sanz J. L., 2010, MNRAS, 404, 895 

Yadav A. P. S., Wandelt B. D., 2010, Advances in Astron- 
omy, 2010 



© 0000 RAS, MNRAS 000, 000-000 



